Visualizing 2019-2020 NBA Season Stats

Introduction: For my R final project, I look at some data from this year’s NBA season using various visualization techniques available through R. I then look at various stats of players and see whether or not there is any correlation between them.

library(plotly)
library(readxl)
library(d3heatmap)
library(viridis)
library(tidyverse)
library(ggfortify)

NBA_Player_Stats_master <- read_excel("C:/Users/mog2/Downloads/2019-2020 NBA Player Stats  NBAstuffer.xlsx", skip = 1)
head(NBA_Player_Stats_master)
## # A tibble: 6 x 29
##   RANK  `FULL NAME` TEAM  POS     AGE    GP   MPG `MIN%Minutes Pe~
##   <lgl> <chr>       <chr> <chr> <dbl> <dbl> <dbl>            <dbl>
## 1 NA    Steven Ada~ Okc   C      26.4    20  27.1             56.4
## 2 NA    Bam Adebayo Mia   C-F    22.4    24  33.5             69.8
## 3 NA    LaMarcus A~ San   F-C    34.4    21  32.9             68.6
## 4 NA    Nickeil Al~ Nor   G      21.3    20  14               29.2
## 5 NA    Grayson Al~ Mem   G      24.2    13  18.9             39.5
## 6 NA    Jarrett Al~ Bro   C      21.6    23  26.5             55.2
## # ... with 21 more variables: `USG%Usage RateUsage rate, a.k.a., usage
## #   percentage is an estimate of the percentage of team plays used by a
## #   player while he was on the floor` <dbl>, `TO%Turnover RateA metric
## #   that estimates the number of turnovers a player commits per 100
## #   possessions` <dbl>, FTA <dbl>, `FT%` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
## #   `3PA` <dbl>, `3P%` <dbl>, `eFG%Effective Shooting PercentageWith eFG%,
## #   three-point shots made are worth 50% more than two-point shots made.
## #   eFG% Formula=(FGM+ (0.5 x 3PM))/FGA` <dbl>, `TS%True Shooting
## #   PercentageTrue shooting percentage is a measure of shooting efficiency
## #   that takes into account field goals, 3-point field goals, and free
## #   throws.` <dbl>, `PPGPointsPoints per game.` <dbl>,
## #   `RPGReboundsRebounds per game.` <dbl>, `TRB%Total Rebound
## #   PercentageTotal rebound percentage is estimated percentage of
## #   available rebounds grabbed by the player while the player is on the
## #   court.` <dbl>, `APGAssistsAssists per game.` <dbl>, `AST%Assist
## #   PercentageAssist percentage is an estimated percentage of teammate
## #   field goals a player assisted while the player is on the court` <dbl>,
## #   `SPGStealsSteals per game.` <dbl>, `BPGBlocksBlocks per game.` <dbl>,
## #   `TOPGTurnoversTurnovers per game.` <dbl>, `VIVersatility
## #   IndexVersatility index is a metric that measures a player’s ability to
## #   produce in points, assists, and rebounds. The average player will
## #   score around a five on the index, while top players score above
## #   10` <dbl>, `ORTGOffensive RatingIndividual offensive rating is the
## #   number of points produced by a player per 100 total individual
## #   possessions.` <dbl>, `DRTGDefensive RatingIndividual defensive rating
## #   estimates how many points the player allowed per 100 possessions he
## #   individually faced while staying on the court.` <dbl>

To tidy the data, I select the columns of information that I want and also change the column names.

NBA_Player_Stats <- NBA_Player_Stats_master %>%
 select(c(2:7,11:16,19,20,22,24:26))

colnames(NBA_Player_Stats)[13:18] <- c("PPG","RPG","APG","SPG","BPG","TOPG")

Currently, the table is arranged with the players and their respective teams, so I wanted to look at the range of points scored by players on each team. To do this, I used ggplot with boxplots and faceted the data by teams.

p <- ggplot(NBA_Player_Stats, aes(x = TEAM, y = `PPG`, fill = TEAM)) + geom_boxplot() + scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9)
ggplotly(p)

I grouped the data by TEAM and added an additional column labeled as total_ppg. I then imported another data table and joined them by team name to get win percentage.

Winning <- NBA_Player_Stats
Winning <- Winning %>%
  group_by(TEAM) %>%
  mutate(total_ppg = sum(PPG))
Standings <- read_excel("C:/Users/mog2/Downloads/2019-2020 NBA Standings.xlsx")
Winning <- left_join(Winning,Standings, by = 'TEAM')
p5 <- plot_ly(Winning, x = ~total_ppg, y = ~`Win%`, type = 'scatter', mode = 'markers',
             marker = list(size = 5, opacity = 0.5),
             hoverinfo = 'text',
             text = ~paste('Team:', `TEAM`, '<br> Win%:', `Win%`, '<br> Total average PPG:', total_ppg)) %>%
  layout(title = 'Total average PPG versus win percentage')
p5
WinningLinear <- lm(total_ppg ~ `Win%`, data = Winning)
broom::tidy(WinningLinear)
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   135.        1.35    99.6   8.02e-315
## 2 `Win%`          2.06      2.51     0.822 4.11e-  1
broom::glance(WinningLinear)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00146     -0.000699  10.8     0.676   0.411     2 -1765. 3536. 3548.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(WinningLinear)

cor(Winning$`total_ppg`, Winning$`Win%`)
## [1] 0.0381848

I wanted to look at some of the stats of the top 50 scoring NBA players and see how their other stats compared.

Top_20 <- NBA_Player_Stats %>%
  arrange(desc(PPG))
Top_20 <- Top_20[1:20,]
Top_20_d3heatmap <- Top_20 %>%
  select('FULL NAME',13:18)
row.names(Top_20_d3heatmap) <- Top_20_d3heatmap$`FULL NAME`
## Warning: Setting row names on a tibble is deprecated.
Top_20_d3heatmap$`FULL NAME` <- NULL
p2 <- d3heatmap(scale(Top_20_d3heatmap), dendrogram = 'none', color = "Blues")
p2

I was then curious to see if a player’s shooting percentage (2P%, 3P%, FT%) had any correlation with a player’s PPG. First, I filtered the initial dataset to only include players that had shot above the average number of shots. Then I graphed percentage versus PPG. The dot size and color represents the overall number of shots taken.

TwoPA <- NBA_Player_Stats %>%
  filter(`2PA` > mean(`2PA`))

ThreePA <- NBA_Player_Stats %>%
  filter(`3PA` > mean(`3PA`))

FTA <- NBA_Player_Stats %>%
  filter(`FTA` > mean(`FTA`))

p5 <- plot_ly(TwoPA, x = ~`2P%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`2PA`, colors = 'Blues',
             marker = list(size = ~`2PA`/10, opacity = 0.5),
             hoverinfo = 'text',
             text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> 2PA:', `2PA`, '<br> 2P%:', `2P%`)) %>%
  layout(title = '2 pointer percentage versus PPG')
p5
p6 <- plot_ly(ThreePA, x = ~`3P%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`3PA`, colors = 'Reds',
             marker = list(size = ~`3PA`/10, opacity = 0.5),
             hoverinfo = 'text',
             text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> 3PA:', `3PA`, '<br> 3P%:', `3P%`)) %>%
  layout(title = '3 pointer percentage versus PPG')
p6
p7 <- plot_ly(FTA, x = ~`FT%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`FTA`, colors = 'Greens',
             marker = list(size = ~`FTA`/10, opacity = 0.5),
             hoverinfo = 'text',
             text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> FTA:', `FTA`, '<br> FT%:', `FT%`)) %>%
  layout(title = 'Free throw percentage versus PPG')
p7
TwoPALinear <- lm(`2P%` ~ PPG, data = TwoPA)
broom::tidy(TwoPALinear)
## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.523     0.0148      35.3   2.09e-84
## 2 PPG         -0.000497  0.000953    -0.522 6.02e- 1
broom::glance(TwoPALinear)
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00146      -0.00391 0.0779     0.273   0.602     2   214. -422. -412.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(TwoPALinear)

cor(TwoPA$`2P%`, TwoPA$`PPG`)
## [1] -0.03825054
ThreePALinear <- lm(`3P%` ~ `PPG`, data = ThreePA)
broom::tidy(ThreePALinear)
## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 0.360      0.00895     40.2   2.46e-94
## 2 PPG         0.0000955  0.000607     0.157 8.75e- 1
broom::glance(ThreePALinear)
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1  0.000132      -0.00519 0.0533    0.0247   0.875     2   288. -571. -561.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(ThreePALinear)

cor(ThreePA$`3P%`, ThreePA$`PPG`)
## [1] 0.01146824
FTALinear <- lm(`FT%` ~ `PPG`, data = FTA)
broom::tidy(FTALinear)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.682     0.0216      31.6  3.31e-70
## 2 PPG          0.00604   0.00132      4.58 9.41e- 6
broom::glance(FTALinear)
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.117         0.112 0.0988      21.0 9.41e-6     2   144. -283. -274.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(FTALinear)

cor(FTA$`FT%`, FTA$`PPG`)
## [1] 0.3422926

Conclusion: Overall, FT% seemed to have the largest correlation coefficient with PPG (0.34) whereas 2P% and 3P% did not. In addition, there was not really any correlation between total points per game scored by a team and their winning percentage.